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Abstract 
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Neutral models are foundational in the archaeological study of cultural transmis- 
' (— j ' sion. Applications have assumed that archaeological data represent synchronic 

Q_i samples, despite the accretional nature of the archaeological record. Using numer- 

ic ical simulations, I document the circumstances under which time- averaging alters 

the distribution of model predictions. Richness is inflated in long-duration assent- 
ed blages, and evenness is "flattened" compared to unaveraged samples. Tests of neu- 
trality, employed to differentiate between biased and unbiased models, suffer se- 
rious problems with Type I error under time-averaging. Estimation of population- 
^ level innovation rates, which feature in many archaeological applications, are bi- 
, ased even without time averaging, but have sharply increased bias given longer 
assemblage durations. Finally, the time scale over which time averaging alters 
predictions is determined by the mean trait lifetime, providing a way to evaluate 
CT) the impact of these effects upon archaeological samples. 

Keywords: cultural transmission, Wright-Fisher model, time averaging, neutral 
theory 
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1. Introduction 

•i-H 

^ The evolutionary study of culture today crosses many disciplines and employs 

o3 a variety of experimental and observational methods to study its subject matter. 

What makes the archaeological record unique as a source of data concerning the 
evolution of culture is time depth, creating the possibility of studying both the 
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unique histories of human groups and the evolutionary processes that shape those 
histories. Archaeology is not unique in studying temporal data on human activ- 
ity, but like our colleagues in paleobiology, we study an empirical record that is 
unlike the time-series data available to disciplines such as economics or epidemi- 
ology (e.g., Arrow, 2009; Keeling, 2005; Keeling and Rohani, 2007; Kendall and 
Hill, 1953; Rothman et al., 2008). The archaeological record is not a sample of 
measurements from individual moments in time stacked together into a sequence. 
Instead, archaeological deposits are almost always accretional palimpsests, repre- 
senting cumulative artifact discard over durations of varying length (Bailey, 2007, 
1981; Binford, 1981; Schiffer, 1987; Stein, 1987). Thus, when archaeologists 
count the richness of faunal taxa in an assemblage, or measure the relative fre- 
quencies of ceramic types, the data obtained summarize the bulk properties of 
artifact discard and deposition over significant spans of time, often with noncon- 
stant rates of accumulation. 1 We refer to assemblages which are accretional in this 
manner as "time averaged." 

A growing number of studies apply cultural transmission models to artifact as- 
semblages by comparing the predictions such models make for the richness, diver- 
sity, or frequency distribution of cultural traits, to counts or frequencies of artifact 
classes (e.g., Bentley and Shennan, 2003; Bettinger and Eerkens, 1999; Eerkens 
and Lipo, 2007; Jordan and Shennan, 2003; Lipo and Madsen, 2000; Perreault 
and Brantingham, 2011; Premo and Scholnick, 2011; Scholnick, 2010; Shennan 
and Wilkinson, 2001; Shennan, 2011; Steele et al., 2010). The question is, are 
model predictions comparable to archaeological measurements? Given the time 
averaged structure of most archaeological deposits, I suspect the answer is no. 
Transmission models developed outside archaeology are typically constructed to 
make predictions concerning variables observed at a point in time. To date, almost 
none of the archaeological literature employing cultural transmission models has 
taken this "time averaging" effect into account and modified the way predictions 
are made to match the nature of the phenomena we measure (cf. Bentley et al., 
2004). Evaluating the effects of temporal aggregation upon the predictions made 
by cultural transmission models is the first step in understanding how to rewrite 
and adapt transmission models to understand their dynamics given time averaged 
observations. 

In his dissertation, Neiman (1990) considered a potential source of time av- 
eraging effects in diachronic assemblages: variation in discard rates across traits. 



As well as the action of various post-depositional and taphonomic processes, of course. 



2 



With respect to this particular effect within accretional deposits, Neiman's re- 
sults suggested that the predictions made by a neutral model of cultural trans- 
mission were directly applicable to the relative frequencies of traits as we would 
measure them in a time averaged assemblage. Nevertheless, there is good rea- 
son to consider the effects of aggregation directly, outside of variation in discard 
rates. Paleobiologists, for example, have documented systematic differences be- 
tween living and fossil assemblages, including increased species richness, reduced 
spatiotemporal variance in taxonomic composition, and flatter species abundance 
curves in time averaged assemblages (Olszewski, 2011; Tomasovych and Kid- 
well, 2010a,b). Lyman (2003) extended these results to zooarchaeology, noting 
that time averaging can be a significant problem when the process one is applying 
or studying occurs over a shorter time scale than the empirical record available to 
study its properties (see also Grayson and Delpech, 1998). This relation between 
time scales is applicable to cultural transmission modeling as well. 

Archaeologists now employ a variety of cultural transmission models, which 
differ in the kind of variation and traits they describe and the copying rules and 
evolutionary processes they incorporate. Discrete models describe individual vari- 
ants or traits by their count or frequency in a population and are foundational for 
the study of stylistic variation in many artifact categories (e.g., pottery). The 
simplest discrete model is random copying in a well-mixed population with in- 
novation, representing neutral variation with the stochastic effects of drift. We 
frequently construct more complex models of transmission bias by adding addi- 
tional terms or frequency-dependent copying rates to the basic unbiased copying 
model (Cavalli-Sforza and Feldman, 1973a,b, 1981; Boyd and Richerson, 1985). 
Thus, an understanding of the effects of time averaging upon neutral transmission 
will be informative about many (if not all) of the discrete transmission models in 
use by archaeologists today, and forms the focus of the present study. 

I report the results of numerical simulations designed to observe neutral trans- 
mission using variables employed in the archaeological literature, aggregated over 
time at a variety of intervals designed to mimic a wide range of "assemblage du- 
rations." In Section 2 I describe the relationship between neutrality, unbiased 
copying, and the separate but related concept of "drift," followed by a review of 
the quantitative properties of the well-mixed neutral Wright-Fisher infinite-alleles 
model in Section 3. Section 4 outlines the simulation model employed to study 
time averaging in this paper, including model verification and testing, and the 
algorithm used to effect temporal aggregation within the simulations. Section 5 
presents the results of simulating unbiased cultural transmission for a variety of 
innovation rates and assemblage durations, and Section 6 summarizes the effects 
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seen and points to next steps in reformulating our cultural transmission models for 
archaeological contexts. 

2. Conceptual Structure of Neutral Cultural Transmission 

In his classic article Style and Function: A Fundamental Dichotomy, Dunnell 
(1978) proposed that many aspects of an artifact would play little or no role in its 
engineering performance, and thus have no impact on the fitness of individuals 
employing it. In other words, some attributes of artifacts are neutral with respect 
to selection. This has been widely misinterpreted as a claim that the artifacts 
themselves are neutral or have no fitness value, which is not the case. Dunnell 
was saying that if one describes an artifact solely using attributes which have equal 
cost or performance, the resulting classes meet the definition of neutral variation. 

Fraser Neiman (1990) first connected Dunnell's identification of style as se- 
lectively neutral variation, to population genetic models designed to describe ge- 
netic drift. His dissertation considers a wide range of cultural transmission mod- 
els, especially those described by Cavalli-Sforza and Feldman (1973a,b, 1981) 
and Boyd and Richerson (1985). Neiman employed simulation to calculate the 
consequences of both individual processes as well as processes combined with 
various archaeological factors such as variable rates of artifact discard. In this 
work, Neiman pioneered virtually every technique used by archaeologists today 
to model and study cultural transmission. The discipline as a whole was intro- 
duced to this work in his now classic 1995 article (Neiman, 1995), in which the 
dynamics of Woodland ceramic variation were explicitly modeled as a random 
copying process. 

Despite the fact that there are multiple ways that neutrality can arise as a pop- 
ulation level effect, there is a tendency today to equate neutrality with "drift" 
in the archaeological literature on cultural transmission. For example, Bentley 
et al. (2004, p. 1443) offer a fairly typical description of unbiased cultural trans- 
mission as "random genetic drift, which describes how the diversity of variants 
evolve when the dominant process is one of random copying." In fact, drift and 
the copying rules that create population-level trait distributions are different and 
independent aspects of a transmission system. Before we turn to the details of a 
formal model for unbiased, neutral transmission, it is worth reviewing the concep- 
tual elements that make up such models. 

Drift is a feature of any stochastic transmission model in a finite population, 
regardless of whether selection or bias is also present in the model. Sewall Wright 
gave the name "genetic drift" to the random fluctuations in gene frequency that 
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occurred because some individuals might be the source of many genes in the next 
generation, and others none at all. Translated into a cultural model, drift occurs 
when some individuals, by random chance, are imitated or copied and others are 
not. In an infinite population, by contrast, the variants held by individuals would 
be sampled at their exact frequencies in the population, and thus there would be no 
stochastic "wiggle" in trait frequencies. This is reflected in population genetics by 
the famous "Hardy-Weinberg" equilibrium, where in the absence of selection or 
other forces, gene frequencies stay the same from generation to generation. This 
means that we can easily have neutrality without drift, in an infinite population. In 
a large but still finite population, we can expect drift to have very tiny, potentially 
even unmeasurable effects upon the trajectory of trait frequencies. 

Drift, moreover, occurs in combination with a variety of inheritance rules, 
mutation models, and in combination with natural selection. In small populations, 
we can expect drift to be a factor when examining the engineering properties of 
ceramics and the relative fitness of firing technologies, or the fitness of foraging 
strategies. Whenever such traits are learned and passed on within small, finite pop- 
ulations, the stochastic aspect of who learns from whom will create fluctuations 
in variant frequencies that have nothing to do with the performance or survival 
value of traits, or the prestige of those we choose to learn from or imitate. In 
other words, we can have drift without neutrality. In small enough populations 
or during bottlenecks, even adaptive technologies and knowledge can be lost to 
drift (Henrich, 2004, 2006). We should always be on the lookout for the effects 
of drift, especially as population sizes get smaller as we go back in time. Drift is 
not a model of human social learning; it is a consequence of finite populations, 
injecting stochastic noise into the dynamics of a system that affects our ability to 
cleanly fit models and test hypotheses. 

Neutrality, by contrast, is a population level phenomenon, arising when there 
is no net force systematically favoring certain variants over others for a partic- 
ular dimension of variation. Most commonly, of course, we mean that there is 
no natural selection that favors some alleles over others, but from a mathematical 
perspective, the transmission bias rules of Cavalli-Sforza and Feldman (1981) and 
Boyd and Richerson (1985) are equivalent to selection models. 2 The simplest way 
for neutrality to arise is for individual social learning to be "unbiased." Unbiased 



2 In this paper I leave aside the relationship between "natural" and "cultural" selection, and 
transmission biases, since such issues are largely philosophical and theoretical and do not affect 
the nature of the models we employ for quantitative analysis of cultural variation. 
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transmission models always yield population-level neutrality for the traits being 
passed, because the probability of imitating any specific trait is simply propor- 
tional to its frequency among individuals in the population. The Wright-Fisher 
model is one of the earliest stochastic models in population genetics (Provine, 
1989, 2001; Wright, 1931), and was originally created to describe the process of 
genetic drift and its effects in combination with other evolutionary processes. Fol- 
lowing Kimura's theory of neutral alleles, Wright-Fisher is also used to describe 
the evolution of populations in which variants are selectively neutral. Elaborations 
of the basic Wright-Fisher model add mutation, selection, loci with multiple alle- 
les, and multiple loci with interactions between loci (see esp. Crow and Kimura, 
1970; Ewens, 2004). 3 

But unbiased copying is not the only source of neutrality among variants, and 
it is important to keep this in mind when selecting models to test as explanations 
for archaeological phenomena. In any realistic human population, there will be 
heterogeneity in social learning rules, with individuals using different rules for dif- 
ferent traits, or kinds of traits, and perhaps having individual propensities for con- 
formism (all other things being equal) or pro-novelty bias (Mesoudi and Lycett, 
2009). A population which is heterogeneous for such rules may display the char- 
acteristic frequency distributions of conformity or pro-novelty biased if we are 
able to observe small numbers of transmission events or individual transmission 
chains, while simultaneously cancelling each other out at the level of the popu- 
lation. In other words, heterogeneity is a major source of equifinality between 
different models of social learning, when observed through population-level trait 
frequencies. No archaeological applications of cultural transmission models to- 
day have employed heterogeneous models, probably because the theory behind 
such models is not well-studied. But this is clearly a frontier for future research 
since homogenous models poorly reflect what occurs in real human populations. 

Returning to unbiased models of transmission, we face a further choice in 
selecting a specific model to employ or study. In addition to the copying rules, 
we must specify an innovation rule. Such a rule answers questions like: how do 
new variants enter the population, can variants be invented multiple times inde- 
pendently, and is there a constrained range of variation for a particular dimension 



3 And, the Moran family of models mirrors the Wright-Fisher models, with overlapping genera- 
tions, by representing dynamics as continuous-time stochastic processes. Moran models are likely 
the best framework for modeling cultural transmission when the exact temporal dynamics matters. 
In this paper I follow archaeological convention by employing the more familiar Wright-Fisher 
discrete generation framework. 
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of an artifact? For example, painted design elements on a ceramic pot offer a 
"design space" of possibilities that is potentially unbounded, even if only a tiny 
fraction of possible designs occur in any archaeological context. Such attributes 
are best modeled by the "infinite allleles" innovation model. In contrast, stylistic 
aspects of lithic tools may be sharply constrained by the technology and mate- 
rials themselves, and may be best modeled by innovation among a small set of 
variants, with the material constraints causing frequent "reinvention" of the same 
shapes over and over. Such attributes are best modeled by constraining the design 
space, and employing a finite or "k-alleles" version of the unbiased model. Since 
Neiman's pioneering work, most archaeological applications of neutral models 
have employed the "infinite alleles" variant of the Wright Fisher model (WF- 
IA)(Kimura and Crow, 1964). Therefore, in the remainder of this paper, I focus 
on the unbounded model of neutral evolution with innovation, since it is relevant 
to a large number of archaeological contexts and artifact categories, but the reader 
should be aware that the models with a constrained number of variants may be 
hugely important in specific archaeological contexts, and are underexplored in the 
archaeological literature. 

3. Unbiased Transmission: The Wright-Fisher Infinite-AUeles Model 

WF-IA is a stochastic process that models unbiased transmission within a 
fixed- size population as multinomial sampling with replacement, with a mutation 
process that adds new variants to the population at a known rate. After describ- 
ing the model, I review the sampling theory of Ewens (1972), which gives the 
distribution of variants expected in small samples taken from the population as 
a whole. The sampling theory, rather than the distribution of variants in the full 
population, is both well-understood, and most relevant to archaeologists, who are 
always sampling an empirical record of past artifact variation. 

The well-mixed neutral Wright-Fisher infinite-alleles model (Kimura and Crow, 
1964) considers a single dimension of variation ("locus") at which an unlimited 
number of variants ("alleles") can occur, in a population of N individuals. 4 The 
state of the population in any generation is given in several ways: a vector repre- 



Conventionally, the model treats a diploid population, in which N individuals each have two 
chromosomes and thus there are always 2N genes tracked in the population. The haploid version 
is more appropriate for modeling cultural phenomena, and thus formulas given in this paper may 
differ from those given by Ewens (2004) and other sources by a factor of two. For example, the 
key parameter 6 is defined as 2Nfi rather than the common genetic definition 4-Nfi. 
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senting the trait possessed by each individual (census), a vector giving the abun- 
dance of each trait in the population (occupation numbers), or by the number of 
traits represented in a population by a specific count (spectrum). 

In each generation, each of N individuals selects an individual at random in 
the population (without respect to spatial or social structure, hence "well-mixed"), 
and adopts the trait that individual possessed in the previous generation. 5 Equiva- 
lently, a new set of N individuals are formed by sampling the previous generation 
with replacement. At rate \x for each individual, a new variant is added to the 
population instead of copying a random individual, leading to a population rate of 
innovations 6 = INji (Ewens, 2004), with no "back-mutation" to existing traits. 6 
An important consequence of this innovation model is that each variant is eventu- 
ally lost from the population given enough time, and replaced with new variants. 
Thus, there is no strict stationary distribution for the Markov chain describing 
WF-IA, although there is a quasi- stationary equilibrium in which the population 
displays a characteristic number of variants, with a stable frequency distribution 
governed by the value of 6 (Ewens, 2004; Watterson, 1976). 

Beginning with a now-classic paper Ewens (1972) constructed a sampling the- 
ory for the neutral WF-IA model, allowing the calculation of expected moments 
and frequency distributions for small samples (compared to overall population 
size) (see Ewens, 2004, for a complete summary of results on the sampling the- 
ory). In what follows, we assume that a neutral WF-IA process is running within a 
population of size N. At some moment in time after the population has reached its 
quasi-stationary equilibrium, we take a sample of n individuals, where the sample 
is small compared to the population size (n «: N). We then identify the variants 
held by each individual. The total number of variants seen in the sample will be 
denoted by k, or k obs depending upon context. 

Given such a sample, Ewens (1972) found that the joint distribution of the 
variant spectrum (a* represents the number of variants represented i times in a 
sample), given the population innovation rate (6), is given by the following for- 



5 An individual can select themselves at random since sampling is with replacement, and this 
would be equivalent to "keeping" one's existing trait for that generation. 

6 It is important to note that 6 is not a measure of the "diversity" of traits in the population, as it 
has been employed in several archaeological studies, but is instead a rate parameter of the model. 
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mula (now known as the Ewens Sampling Distribution): 



p j?iII (fl i ,...,fl B ) = _U_^_ (i) 

7=1 7 ' 

where # (,,) is the Pochhammer symbol or "rising factorial" 6(6+ 1)(6+2) • • ■ (6+ 
n - 1). In most empirical cases, we cannot measure (or do not set through exper- 
iment) the value of 6, so a more useful relation is the distribution of individuals 
across variants (i.e., the occupation numbers), conditional upon the number of 
variants fc obs observed in a sample of size n: 

n\ 

F(m,n 2 ,..., n k \k ohs ) = —— (2) 

\S*\k\nin 2 ■■■n k 

where \S k n \ denote the Stirling numbers of the first kind, which give the number 
of permutations of n elements into k non-empty subsets (Abramowitz and Ste- 
gun, 1965). The latter serves here as the normalization factor, giving us a proper 
probability distribution. 

From Ewens 's sampling theory, and in particular Equation 2, a number of use- 
ful measures can be derived, relevant to archaeological applications. In this study, 
I focus upon the most commonly used: statistical tests of neutrality, estimation of 
innovation rates (6), and the evenness with which variants are represented in the 
population (as revealed by several diversity measures). 

3. 1 . Statistical Tests for Neutrality 

Because Equation (2) requires no unobservable parameters, it serves as the 
basis for goodness-of-fit tests between empirical samples and the neutral WF-IA. 
The two most important such tests are the Ewens-Watterson test using the sample 
homozygosity and Slatkin's "exact" test (Durrett, 2008; Ewens, 2004; Slatkin, 
1994, 1996, 1994, 1996). 7 Both have been adopted for use by archaeologists, 
beginning with Neiman (1995) and Lipo (2001), who described Watterson's work 
in detail, and more recently, applications of Slatkin's exact test by Steele et al. 
(2010) and Premo and Scholnick (201 1). 



7 There are several other important tests of neutrality when dealing with DNA sequence data, 
including Tajima's D, the HKA test, and the McDonald-Kreitman test (Durrett, 2008). Because 
their assumptions are highly specific to the structure of sequence data, I omit consideration of them 
here. 
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The Slatkin test makes no assumptions concerning the process underlying an 
alternative hypothesis to neutrality, whereas the Ewens-Watterson test examines 
the observed heterozygosity at a locus versus the expected heterozygosity pre- 
dicted by Ewens sampling theory. Slatkin's test does not employ the concept of 
heterozygosity, and relies only upon the "shape" of the Ewens Samping Distri- 
bution given a specific innovation rate. As a result, archaeologists should prefer 
Slatkin's test for examining the fit of a synchronic sample of variants to the null 
hypothesis of neutrality. Slatkin's test is modeled upon the Fisher exact test for 
contingency tables. Where the Fisher exact test determines the probability of an 
unordered configuration from the hypergeometric distribution, Slatkin's test deter- 
mines the probability of a sample of traits (characterized by occupation numbers) 
with respect to Equation 2. 

There are two methods for determining how probable a given sample is, with 
respect to the ESD. For relatively small n and k, it is possible to enumerate all pos- 
sible combinations (C) of the n individuals among k variants. Each configuration 
(cj e C) then has a probability given Equation 2, as does the observed configura- 
tion (c obs ). With larger sample sizes and values of K obs , it becomes impractical or 
simply time consuming to enumerate all possible configurations and thus deter- 
mine the likelihood of an observed sample. In such cases, Monte Carlo sampling 
of configurations from the Ewens Sampling Distribution is used. We then de- 
termine the total probability mass of all configurations (enumerated or sampled) 
whose probability are less than or equal to the observed configuration: 

F e = J] p ( c ;'l^ ( 3 ) 

Cj€{C : P(Cj I k) < P(c I *)} 

F e then represents the Fisherian p-value of the sample with respect to the 
Ewens Sampling Formula, and thus can be interpreted as a test of the hypothesis 
that the sample was drawn from a neutral dimension of variation which followed 
the WF-IA copying model. The P e value for a given sample gives the tail prob- 
ability of its occurrence given the ESD. Thus, if we take a sample of size 100 in 
a population with innovation rate 6 = 0.1, and identify two variants with counts 
51 and 49, we might not be surprised to see a F e value of 0.01 181, indicating that 
such a sample is highly unusual for a WF-IA process. On the other hand, in the 
same sample of size 100, if we identify four variants, with counts 55, 38, 6, and 
1, this seems a much more typical result of an unbiased copying process. Indeed, 
the F e value of 0.48544 confirms that we should expect to see such samples quite 
often. 
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3.2. Estimation of Innovation Rates 

The behavior of the WF-IA neutral model is governed by the innovation rate 
(9). Recall that 9 = 2Nfi, and thus represents the population-level rate at which 
new variants enter the population. In general, for low values of the innovation 
rate (9 < 1.0), the process is "drift-dominated," and one or a small number of 
variants dominate the population. At innovation rates above 1.0, which implies 
that every single "generation" incorporates one or more new variants, the pro- 
cess is "mutation-dominated," and more variants are maintained at intermediate 
frequencies in the population. 

Thus, estimation of the innovation rate from empirical data is of great interest 
when investigating empirical cases. If we measure the number of variants (K n ) in 
a sample of artifacts of size n, the sampling theory gives the following probability 
distribution (Ewens, 2004, Eq. 3.84): 

w n = k) = 1 -^ (4) 

This is a somewhat inconvenient distribution to work with directly, since cal- 
culating the Stirling numbers and rising factorials is both analytically difficult and 
computationally expensive, but the expected value of K n has a simple form: 

9 9 9 9 

E(K n ) = - + + + • • • + (5) 

y n) 9 9+ 1 9 + 2 9 + n-\ w 

K n is the sufficient statistic for 9, containing all of the information required to 
calculate the maximum likelihood estimate of the innovation rate (9) from an em- 
pirical sample. This is done numerically by finding the value of 9 that maximizes 
the likelihood function of Equation 4, or equivalently, finding the value of 9 for 
which the expected value of K n given Equation 8 is equal to the observed num- 
ber of variants in a sample (since the full distribution may not have a closed-form 
likelihood function). In the archaeological literature, Neiman (1995) introduced 
this estimator of 9 and called it t e . With larger samples, Watterson (1975) showed 
that k I log n is a good approximation for the MLE estimator (Durrett, 2008). 

Despite the fact that this estimator (and its approximations) are the best that 
can be achieved from samples, Ewens (1972) showed that all such estimates of 9 
are biased. Simulations demonstrate, furthermore, that 9 (or t e ) is an overestimate 
of the actual value, and that the amount of bias increases with 9 itself (Ewens 
and Gillespie, 1974). In addition, the variance of the estimator is quite large, and 
decreases very slowly with increased sample size (Durrett, 2008). The situation 
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is quite different using the "infinite sites" model of neutral evolution and DNA 
sequence data, where there are excellent and nearly unbiased estimators of theta. 

But with the WF-IA and no additional structure to "traits" or alleles, it is very 
difficult to estimate the innovation rate with any accuracy, or determine whether 
two samples come from populations with the same innovation rate, or different 
rates. This fact calls into serious question the degree to which t e is useful in ar- 
chaeological analysis, either for estimating innovation rates in past populations, 
or as a measure of richness or diversity across assemblages or samples. These 
caveats apply to estimates of innovation rates and t e given synchronic samples; 
the effects of time averaging on theta estimation have not been previously docu- 
mented, and are addressed in Section 5.3. 

3.3. Diversity Measures 

The amount of variation expected in a sample is an important quantity, given 
that we would clearly expect transmission models incorporating bias terms to dif- 
fer from unbiased or neutral models (e.g. Kohler et al., 2004). Conformist trans- 
mission should result in smaller numbers of variants than expected under unbiased 
transmission, and of course anti-conformist, or "pro-novelty", biases should result 
in larger numbers of variants being maintained, on average. But beyond helping 
us assess goodness-of-fit to an unbiased copying model, comparing the number of 
variants in a sample (K„) either to a model, or between assemblages, is difficult 
without reliable estimates of the population-level innovation rate (6). Since this 
is inherently difficult and inaccurate, we might ask instead what the evenness of 
variants is across our samples, since both innovation rates and different models of 
cultural transmission have clear implications for the diversity of traits we observe. 

In the archaeological literature on cultural transmission, the most important 
evenness measure is tf, which is a summed estimate of dispersion given trait fre- 
quencies Neiman (1995): 



To make this measure easier to compare across different innovation rates, it is 
convenient to normalize. Wilcox's "index of quantitative variation," does so, and 
varies between (when all cases belong to a single category), and 1 (when all 
cases are evenly divided across categories) (Wilcox, 1973): 




(7) 
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Paleobiologists have found that fossil assemblages have considerably "flatter" 
species diversity curves compared to living communities, and I expect that time 
averaging will have the effect here of pushing IQV towards 1.0 compared to its 
value in unaveraged samples. 

4. Methods 

In this research, I employ a "forward-time" approach to computational mod- 
eling of unbiased cultural transmission, by contrast to most modeling in theoreti- 
cal population genetics today, which employs the coalescent or "backward-time" 
approach (Kingman, 1977; Durrett, 2008; Wakeley, 2008). In archaeological re- 
search, we are interested in the entire distribution of variants which transmitted 
through the population, samples of which may be deposited and become part of 
the archaeological record regardless of which variants ultimately leave descen- 
dants in later generations. Forward-time approaches evolve a population in steps, 
applying rules for the generation of variation, copying between individuals, in- 
novation, and sometimes population dynamics. 8 Several well-tested forward-time 
population genetic frameworks exist, including a very flexible framework called 
simuPOP (Peng et al., 2012; Peng and Kimmel, 2005). 

In this research, I employ a framework written by the author specifically for 
cultural transmission simulations. This project calls for integrating computa- 
tion models of archaeological classification and seriation, which require code be- 
yond that supplied by population genetics frameworks. My simulation codebase 
is called TransmissionFramework, and is available as open-source software. 9 
TransmissionFramework runs on any platform capable of supporting a Java 1 .6+ 
runtime, with optional scripts requiring Ruby 1.9+. 

4.1. Model Verification 

Simulation modeling plays an increasingly important role in scientific inquiry, 
to the extent that computational science is now recognized as a third branch of 
physics, along with the pre-existing theoretical and experimental branches (Lan- 
dau and Binder, 2005). Indeed, as theory becomes more complex and realistic, we 
often cannot directly solve theoretical models and derive predictions that should 



Forward-time approaches are not necessarily equivalent to "agent-based models," but ABM 
techniques are useful in implementing forward-time models. 

9 TransmissionFramework can be downloaded or the code examined at http : //github . 
com/ mmadsen/Tr ansmi s s i onFr amework. 
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be measurable by experiment. Computational science sits between theory and 
experiment, allowing us to understand the behavior and dynamics of complex 
theoretical models, and calculate predictions that can be used for experiment or 
hypothesis testing. 

The problem of assessing simulation model quality is important enough that 
the Department of Energy and the Air Force Office of Scientific Research re- 
quested that the National Research Council study the foundations of verification, 
validation, and uncertainty quantification (VVUQ) activities for computational 
models in science and engineering. Their draft report forms the basis of my ap- 
proach to verification and uncertainty analysis in this research (Committee on 
Mathematical Foundations of Verification Validation and Uncertainty Quantifica- 
tion, National Research Council, 2012). 

Verification answers the question, "how accurately does a computational model 
solve the underlying equations of a theory for the observable quantities of inter- 
est." Given that we know the true value of 9 which drives our simulation runs, it is 
possible to calculate the expected number of variants at stationarity, and use this 
to verify that TransmissionFramework is correctly implementing the WF-IA. 
The expected number of traits is a good validation estimate because the number 
of variants present in a sample will be sensitive to the relative rates of copying 
and innovation events being handed correctly in the simulation code. Errors in 
handling these events in software will be magnified across many individuals over 
many simulation steps. 

Since 9 is known, the mean value of K„ is well approximated by: 

E e (K n )= f {\-{l-x) n )-{\-x) e ~ l dx (8) 
Jo x 

Using Equation (8), I compared expected K n to the average of k n for a large 
sample of simulation runs. To ensure that behavior is correct across a range of 
useful 9 values, I performed multiple simulation runs at 9 values ranging from 2 
to 40, for 5000 generations in a simulated population of 2000 individuals. Each 
parameter combination was represented by 3 simulation runs. The initial transient 
behavior of the model is discarded from data analysis by skipping the first 750 
generations, given the mixing time analysis by Watkins (2010). At each time step 
in a simulation run, the simulator took a sample of 30 individuals and tabulated 
the traits held by those individuals, and recorded the value of K n . This yielded 
408,478 samples across validation runs. 

Using Mathematica 8.0 with MathStatica 2.5 installed, I calculated expected 
values for each 9 level used in simulation, employing Equation (8). Table 2 com- 
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pares the expected and observed values. In all cases, the analytical results are 
extremely close to the observed mean K n values from simulation, and certainly 
well within 1 standard deviation. Thus, I conclude that the TransmissionFrame- 
work implementation of WF-IA employed in this study accurately represents the 
desired theoretical model. 

4.2. Time -Averaging and Simulation Parameter Space 

Time- averaging was modeled in TransmissionFramework by implementing 
a series of statistical "windows" within which trait counts were accumulated be- 
tween time steps. At the end of each temporal window, a sample of appropriate 
size is taken from the accumulation of trait occurrences, trait counts within that 
sample tabulated, and K n values recorded. The simulator architecture allows an 
arbitrary number of temporal windows to be employed simultaneously (albeit with 
a small performance penalty for each window). As a consequence, during a sin- 
gle simulation run, the simulator tracks both unaveraged statistics and the same 
statistics averaged over any number of "assemblage durations." All trait samples 
taken in the simulator, whether unaveraged or for a specific assemblage duration, 
were also recorded to allow calculation of Slatkin's Exact test. Additionally, to 
facilitate analysis of time scales within the simulation model, for each trait the 
interval between entry and loss through drift was recorded. In the simulation re- 
sults reported here, trait samples were of uniform size 100. Constant sample size 
removes the effect of different sample sizes on the reported results, although the 
interaction of the fixed sample size and the innovation rate will lead to cutoff be- 
havior at very high 6 values. This is acceptable since the very highest 6 values 
employed here are unrealistic for almost any prehistoric phenomena, and may be 
approached only for "viral" behavior on modern social networks. 

All simulations reported here were performed with a population size (N) of 
2000 individuals, and simulation runs were conducted for the following values 
of 0: 0.1, 0.25, 0.5, 1.0, 2.0, 5.0, and 10-100 at intervals of 10. This range en- 
compasses innovation rates that are very small, through populations in which a 
full 5% of the population has a never-before-seen variant each generation. Sim- 
ulations were performed in several batches, with a core set of runs performed for 
40,000 steps in order to determine the effects of long-duration time averaging, 
yielding simulated assemblages at a variety of windows ranging from 3 steps to 
8000 steps (the exact durations sampled are given in the first column of Table 1). 
In order to increase the sample size of long-duration assemblages, a second set of 
simulation runs using the same parameters were done with only the five largest 
windows recorded (the short duration window sizes were discarded to avoid a 
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flood of raw data beyond that needed for stable statistical analysis). Finally, since 
the statistical behavior of the process at very small values of 6 is highly variable, 
a third set of runs was performed to increase the number of samples for 9 values 
between 0.1 and 1.0. 

Trait samples were post-processed outside the simulator environment, since 
calculation of Slatkin Exact tests within the simulator itself would slow down the 
primary simulation model by a large factor. Montgomery Slatkin's original C lan- 
guage program was used in Monte Carlo mode to produce an estimate of F(E) for 
each sample of individuals. I modified Slatkin's original montecarlo . c program 
to not require the data to be embedded in the source code, instead taking data as 
a command line parameter, and outputting only the F(E) value and 9 estimate, to 
allow easy post-processing of the simulation output. 10 

The simulation results reported here, once post-processed, comprise 3,024,085 
sample values for K„, across the 9 values listed above, and broken down across 
assemblage durations as in Table 1, and 1,113,134 Slatkin Exact test results for 
the same combinations of 9 and assemblage duration. 

5. Results 

Simple inspection of the relationship between assemblage "duration" (i.e., ac- 
cumulation interval) and the average number of variants (K n ) in a sample of size 
100, shows a strong time averaging effect (Figure l). 11 Temporal aggregation of 
the results of transmission inflates the number of variants we see in a sample, 
with greater effect as the population innovation rate (9) increases. The effect is 
very small at low theta values (i.e., when the process is drift-dominated, 9 < 1.0) 
and requires long accumulation of copying events to have a measurable effect 
upon mean K n . Conversely, inflated K n appears at fairly short duration as theta 
increases. 

Simulation steps (or "generations") represent an arbitrary time scale with re- 
spect to the chronological time archaeologists can (with effort) measure. In or- 
der to understand the effects of time averaging on archaeologically-relevant time 
scales, it will be useful to rescale simulation time by some factor which is observ- 



These modifications are available, along with all other analysis scripts, in the Github reposi- 
tories http://github.com/mmadsen/saa2012, and the TransmissionFramework source code. 

"Here, the time axis represents raw simulation steps, each of which represents N = 2000 copy- 
ing events within the population. This is the only figure in this paper which uses raw simulation 
time steps as the time variable. 
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able as a function of artifact class duration in the depositional record. I take up this 
issue further in Section 6, but the ideal time scale would be the mean duration of 
artifact classes in the classification system being used in a given empirical study. 
I do not explicitly model archaeological classification in the present results, but a 
related measure is the lifespan of the traits being transmitted within the simulated 
population. 

5.1. Time Scales and Time averaging 

The "mean trait lifetime" in WF-IA is a direct consequence of the balance be- 
tween innovation and loss of traits to drift, in a fixed-size population. At the quasi- 
stationary state, the population will fluctuate around a mean number of traits, as 
individual traits enter and leave the population constantly. This implies that at 
stationarity, if we add traits at a higher rate due to migration or innovation, more 
traits must be lost to drift each generation. WF-IA thus satisfies a balance equation 
characterizing the average number of variants (n)(Ewens, 1964): 



where t represents the average number of generations that a new trait lasts in the 
population before its loss to drift (i.e., the mean trait lifetime). 

An exact expression for mean trait lifetime has not been derived from the 
transition probabilities of the WF-IA Markov chain (Ewens, 1964), but it can 
be approximated by summing the average amount of time that a trait within a 
population spends at specific frequencies (i.e., mean sojourn times). Ewens (2004, 
Eq. 3.20) gives the following approximation: 



Since 9 is in the denominator of the summation, increasing the population rate 
of innovation reduces the mean trait lifetime by decreasing the amount of time 
any specific trait spends at a given frequency, and thus the total amount of time a 
trait spends in the population before being lost to drift. 

Table 3 lists the observed mean lifetime of traits for each level of 6 employed in 
this study, and the expected value as calculated using Equation 10. The observed 
values are systematically lower than the expected values, which reflects slightly 
faster loss of traits due to drift in a finite and small population compared to the 
large populations often studied in population genetics (Ewens, 1964; Kimura and 




(10) 
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Crow, 1964). Examination of Figure 1 appears to show that the onset of time 
averaging effects, however small, occurs around the time scale of the mean trait 
lifetime, for values of 9 > 1.0. This outcome is sensible given the enhanced 
probability of longer duration samples incorporating new variants in the sample 
due to innovation. In the analyses to follow, I scale the time variable by the mean 
trait lifetime, displaying assemblage duration as a multiple of this value. Thus, 
for the remainder of this paper, a scaled assemblage duration of 100 will indicate 
100 times the mean trait lifetime at that specific 9 value. For example, if we are 
examining results at 9 = 5.0, a scaled duration of 100 would indicate 12.43* 100 = 
1243 simulation steps. 

5.2. Neutrality Testing 

The Slatkin Exact test for neutrality, discussed in Section 3.1, determines the 
"tail" probability for a sample of size n, with observed number of traits k, to be 
derived from the Ewens Sampling Formula (Equation 2). The test employed in 
this study is Slatkin's Monte Carlo version, which allows the use of larger sample 
sizes, using random selection to create unlabeled configurations from the ESD to 
compare against the observed values. The resulting tail probability is converted 
into a standard hypothesis test by selecting an a value. For purposes of this study, 
I considered the upper and lower 5% of tail probabilities to indicate that a sample 
was probably not derived from a neutral transmission model, leading to a = 0.10. 

Given this a level, we should expect roughly 5% of the samples taken from 
a pure neutral copying process to fall into each of the the upper and lower tail 
regions, and thus for a Slatkin Exact test to reject the null hypothesis of neutrality. 
Roughly 90% of the samples we take from the neutral WF-IA process should fall 
between 0.05 < p < 0.95 and thus lead to acceptance of the null hypothesis. This 
experimental setup also implies the limited utility of performing a single neutrality 
test on a single sample of class counts or frequencies, as has been archaeological 
practice by necessity. A single Slatkin exact test with P e value of, say, 0.96, would 
constitute some, but relatively weak, evidence of non-neutrality. Better practice 
would be taking many samples from a large assemblage or multiple collections 
and calculating independent Slatkin tests for each sample, and examining their 
distribution. 

If time averaging has no effect on the validity of the Slatkin Exact test em- 
ployed against temporally aggregated samples, we would expect the fraction of 
samples in the two tails (upper and lower 5% in this case) to equal 10%. Any- 
thing over 10% would constitute evidence of extra Type I errors, since we know 
the samples to have been generated by a process meeting the definition of the 
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null hypothesis. Therefore, after post-processing the simulation output to produce 
Slatkin tests as described in Section 4.2, 1 tabulated the fraction of Slatkin Exact 
tail probabilities that exceeded the expected 10% tail population. These are, in 
other words, "excess" failures of the Slatkin Exact test, beyond those expected by 
the probability distribution itself. For each 6 level, and for each time averaging 
duration, the mean "excess" failure rate was computed, from the 1,113,134 raw 
Slatkin Exact test results generated in the simulation study. 

Figure 2 depicts the relationship between the excess failure rate, and time av- 
eraging duration scaled by the mean trait lifetime (as previously described). The 
mean trait lifetime is indicated by a vertical red bar in each graph. Three major 
results are apparent. First, at values of 6 > 1.0, the excess failure rate in non-time- 
averaged data is zero, as one would expect, and then begins to increase (albeit 
slowly) as the time averaging duration of samples exceeds the mean trait lifetime. 
In some cases, such as 9 = 5.0, the Slatkin Exact test continues to be accurate 
given the chosen a value through samples which are aggregated for 10 times the 
mean trait lifetime. But in all cases, with sufficient time averaging, the Slatkin 
Exact test begins to suffer from increased Type I error, reporting an ever increas- 
ing fraction of samples as not derived from a neutral transmission process. The 
extreme situation is seen at very high rates of innovation, where nearly every test 
fails, at high levels of time averaging. These failures are caused by saturation of a 
finite sample with singleton traits, causing the sample to display too much even- 
ness in frequency to have been the result of sampling from the Ewens Sampling 
Formula. But unrelated to this saturation effect, there is considerable failure in 
employing the Slatkin Exact test to detect neutrality. For example, at 6 = 5.0, at 
100 times the mean trait lifetime, approximately 70% of all samples appear in the 
tail region of the distribution, compared to the expected 10%. Clearly, the Slatkin 
Exact test is not robust for long-duration assemblages. 

Second, at low 8 values, the test results show excessive Type I error, even 
without time averaging. There are several potential causes. It is possible that the 
WF-IA process had not reached quasi- stationarity by 750 time steps, when sam- 
pling began. This would mean that the effects of initial trait assignment might still 
be present and skewing the frequency distribution of traits. Second, the Slatkin 
test is sensitive to the number of rare or singleton traits given the sample size, and 
in a small population (2000 individuals) with a low innovation rate (e.g., 6 = 0.1), 
counts of rare traits could be unstable. This would not typically be the case in 
samples from large populations or entire species. I do not consider the cause of 
this anomaly further in this paper, but it warrants further simulation study. 

In general, with long-duration assemblages, archaeologists should be care- 
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ful interpreting the results of neutrality tests adopted from population genetics. 
The effect seen here can be summarized as: with significant time averaging, trait 
frequencies generated by unbiased cultural transmission can falsely appear to be 
non-neutral and thus driven by bias or selection (Type 1 error). The longer the 
duration of an assemblage with respect to the mean trait lifetime, the larger the 
probability of a Type 1 error. With sufficient duration, in fact, the probability of 
a Type 1 error becomes virtually certain, and the Slatkin Exact test loses any dis- 
criminatory power. In summary, if one were to employ Slatkin's test to examine 
the hypothesis of neutrality in long-duration archaeological deposits, one would 
overwhelmingly come away with the impression that most cultural transmission 
was biased, either towards conformity or a pro-novelty bias - regardless of the 
underlying process occurring during prehistory. 

5.3. Theta Estimation and Innovation Rates 

There would be considerable value in estimating the population-level inno- 
vation rate (9) from sample data if it could be done accurately. As discussed in 
Section 3.2 above, such estimates are usually biased and have large variance. In 
this section, I examine the effects of time averaging upon theta estimates gener- 
ated from the samples taken to perform neutrality tests in the previous section. 
For each of the 1.1 million samples of variants (distributed across actual theta 
values and assemblage durations), I calculated theta estimates given Watterson's 
approximation (Durrett, 2008): 

e*j^- (ii) 

log n 

For each combination of actual theta and assemblage duration, theta estimates 
were averaged, to give a mean estimated theta value (E(0)), and its standard devia- 
tion. The results are shown in Table 4. There are two regions of behavior apparent 
in the table, corresponding to drift- versus innovation-dominated dynamics. At 
and below 9 = 1.0, estimated values are higher than the actual 9 used to gener- 
ate samples, and above 1.0, theta estimates begin to systematically lag below the 
actual theta value. Overestimation at 9 < 1 .0 matches the simulation results by 
Ewens (1974), although the authors did not simulate innovation rates above 2.0 
(a large value in most genetic situations). In addition to being biased, theta esti- 
mation appears to be even approximately accurate only within a narrow range of 
values around 9 = 1.0. 

Figure 3 examines estimates of theta by time averaging duration scaled by the 
mean trait lifetime, for each level of actual 9 used in the simulation runs. The 
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pattern evident in synchronic or unaveraged samples carries over to time averaged 
assemblages: below 6 < 1.0, theta estimates are larger than the actual values, and 
increase in a non-linear fashion with assemblage duration. Above 1 .0 but below 
about 30.0, theta estimates begin below the actual value, cross the actual value, 
and continue to accumulate as assemblage duration increases. Finally, at the very 
highest innovation rates, in a sample size 100, theta estimates are always drastic 
underestimates of the actual value, even with long assemblage duration increasing 
the accumulation of traits. 

The Slatkin Exact test software also provides an estimate of 6, finding the 
maximum likelihood value of theta when K n is set in Equation 8 to equal the ob- 
served value (this is the t e statistic introduced to archaeological usage in Neiman, 
1995). Figure 4 depicts the Slatkin theta estimates by time averaging duration 
scaled by the mean trait lifetime, for each level of actual 6 used in the simulation 
runs. One interesting difference between Figure 3 and the Slatkin theta estimates 
is that the latter are more accurate for actual 6 > 1 .0 than the Watterson approx- 
imation, in unaveraged assemblages. Unfortunately, with increased assemblage 
duration, estimates explode to much larger values than those calculated by the 
Watterson approximation (i.e., 6 ~ 1500 for true 6 = 30 at maximum assemblage 
duration of 1000 times the mean trait lifetime, compared to the underestimate of 
approximately 22 in Figure 3). 

In short, estimation of population-level innovation rates from samples of arti- 
facts using either estimation method are inaccurate, and the time averaging effect 
of accretional deposition renders such estimates even more inaccurate. Clearly, 
such values cannot be used as actual indications of innovation rate or to "work 
backward" towards past population sizes. And without fairly precise control over 
assemblage duration, the use of t e as a relative diversity measure between assem- 
blages (in the manner common to archaeological applications) is highly suspect. 
In the next section, I turn to tf, the other common diversity measure in archae- 
ological studies, which does not require an estimate of 9, employing instead the 
variant frequencies directly. 

5.4. Diversity Measures 

Much of the current effort in distinguishing biased and unbiased transmission 
models rely upon trait evenness and the shape of frequency distributions, given 
Alex Bentley's application of power-law distributions to both ancient and contem- 
porary data sets (Bentley and Shennan, 2003; Bentley et al., 2004, 2007; Bentley, 
2007; Bentley et al., 2009; Hahn and Bentley, 2003; Herzog et al., 2004). One 
of the ways that unbiased and "conformist" models of cultural transmission differ 
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is in the expected amount of variation. Compared to unbiased transmission, con- 
formism of even a mild degree tends to strongly concentrate adoption onto a very 
small number of traits (Mesoudi and Lycett, 2009). 12 It is difficult, however, to 
interpret the absolute number of traits (K n ) without knowledge of the population 
size, so Kohler et al. (2004) employed diversity measures instead in his classic 
examination of conformist transmission in Southwest pottery. 

The most commonly used measure in the archaeological literature on cultural 
transmission is tf (Equation 6), since it is related to Wright's original measures of 
heterozygosity and thus associated directly with the historical development of the 
Wright-Fisher model. But it is useful to normalize the results of tf between and 
1 so that we can compare different levels of theta and assemblage durations easily, 
in the same way that statisticians occasionally employ coefficients of variation or 
normalize covariances into correlation coefficients. Equation 7 does exactly this, 
and is called the "index of qualitative variation" (IQV) (Wilcox, 1973). 

Figure 5 displays the relationship between the IQV for samples of size 100, 
and time averaging duration scaled by mean trait lifetime, as before. IQV values 
range from 0.0, if only a single trait occurred within a sample (which happens 
in simulations with very low innovation rates), through 1.0, which indicates that 
traits are perfectly evenly distributed within a sample. Even at the highest in- 
novation rate studied, values of 1.0 were not seen in unaveraged samples from 
the simulation runs. It is apparent that time averaging can yield greater evenness 
among trait frequencies, although the plateau in IQV values seen at high 6 and 
high assemblage duration is a function of the saturation of K n in a finite sample 
seen above. At very low innovation rates (6 «; 1 .0), time averaging in contrast 
seems to have little effect on the dispersion of trait frequencies, with one or a very 
few traits always dominating a sample. 

In between, when innovation rates are sufficient to guarantee at least one in- 
novation on average per model generation (6 = 1.0) but fewer than 10, there is 
non-monotonic behavior apparent in the IQV index. For example, at 6 = 2.0, time 
averaging has no effect on IQV until duration is 10 times the mean trait lifetime 
(t), at which point assemblages begin to appear less even in frequency distribution, 
until about 100 times the mean trait lifetime, when evenness begins to steadily in- 



This is especially the case when conformist transmission is implemented in simulations as a 
"global" rule where only the most common trait is copied during "conformist" copying events, 
rather than weighting all traits by their relative popularity. Very little work has been done to 
compare the results from different methods of simulating biased transmission models. This is a 
topic which would benefit greatly from additional research. 
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crease. This effect is interesting, since it suggests that we cannot easily compare 
diversity indices between assemblages unless we control for duration or have in- 
dependent evidence concerning innovation rates. 

6. Discussion and Conclusions 

When we examine the effects of time averaging on the sample properties of 
unbiased transmission, using the mean lifetime of traits as our fundamental time 
scale, several lessons for practical applications emerge. First, it appears that as- 
semblages with very small amounts of temporal aggregation display little of the 
distributional alterations that characterize long-duration assemblages. Statistical 
tests of neutrality and diversity measures, and thus arguments based on them, can 
probably be used with care. Second, estimates of population-level innovation rate 
derived from Ewens's sampling formula are biased (and therefore inaccurate), 
and become seriously inaccurate with increased assemblage duration. Archaeol- 
ogists should strongly reconsider using t e or other theta estimates even in relative 
comparisons, and should definitely not consider such estimates to reflect the in- 
novation rate or population size present in the prehistoric population. Third, for 
assemblages that have a duration longer than the mean trait lifetime, it is important 
to measure and control for the relative duration of assemblages when comparing 
statistical results across samples. Without doing so, we cannot interpret relative 
differences of diversity indices or trait richness values as indicative of different 
modes of transmission. 

One caveat to the above is that such effects refer specifically to assemblage 
level data, composed of many artifacts deposited over time. Artifact- scale analy- 
sis, where the attributes under analysis come together in a short period of time, and 
where single artifacts comprise the counting unit for transmission studies, will not 
necessarily suffer the quantitative effects described here, or would suffer no mea- 
surable time averaging effects if the assemblage durations were short compared to 
the lifetime of traits. A good example of this is Jonathan Scholnick's chapter in 
the present issue, expanding on his previous research into cultural transmission in 
Colonial America through gravestones (Premo and Scholnick, 2011; Scholnick, 
2010), where his samples cover 10 year periods based on the death dates carved 
on each stone. 

Furthermore, while the mean lifetime of transmitted information plays a cen- 
tral role in establishing a "natural" time scale over which time averaging affects 
unbiased transmission, this time scale is not an archaeological one. This dis- 
crepancy in time scales arises because the abstract "traits" of our models are not 
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equivalent to the classification units employed by archaeologists. This is not a 
trivial difference, and is one that is rarely even discussed in archaeological appli- 
cations of cultural transmission models. Instead, we frequently act as if "traits 
equal types," despite occasional acknowledgement of the difference. 

But we have no direct empirical access to the information prehistoric pop- 
ulations were learning, teaching, and imitating. We will never find "units of 
transmission" in any empirical sense for archaeological applications of cultural 
transmission models, and we have no warrant to equate our models of prehistoric 
information flow with the classes we use to observe it today. Long ago, Osgood 
(1951) recognized that when anthropologists study the ideas held within a social 
group under study, what is actually being studied are the ideas we construct about 
the ideas individuals in other cultures may have had. Dunnell (1971) systematized 
this distinction, pointing out that we always operate with analytic classes whose 
construction is done by archaeologists, for archaeological purposes. These classes 
serve as a "filter" by which we detect patterns in artifact assemblages, which re- 
flect patterns in the information which flowed within past populations. There is no 
"natural" set of classes to employ in studying cultural transmission, but we often 
forget to incorporate this fact into our analyses. Linking the time scale over which 
variation entered and left a prehistoric population, and the time scale over which 
archaeological classes appear and then exit the archaeological record will involve 
further research on the relationship between transmission dynamics and complex, 
multi-dimensional archaeological classes. Such research is essential to connect 
the abstract quantities described by theoretical models, to observable aspects of 
the archaeological record. 

These results paint a fairly gloomy picture of almost all of the standard vari- 
ables archaeologists have used since Neiman's (1995) pioneering work. One won- 
ders why empirical studies using diversity measures, innovation rate estimates, or 
neutrality tests appear to "work" and give sensible results? One possibility, of 
course, is that some studies don't yield the expected results. We see this, possibly, 
in a fascinating analysis by Steele et al. (2010). The authors employed ceramic 
classes that appeared to be non-neutral and subject to selection or biased transmis- 
sion. Yet Slatkin exact tests were unable to rule out the null hypothesis of neutral- 
ity. I do not present an analysis of conformist transmission under time averaging 
in this article, but using TransmissionFramework I see evidence that temporal 
aggregation has the opposite effect on Slatkin exact tests in populations with weak 
conformist biases: neutrality tests suffer increased Type II error, making it more 
likely that we will accept a null hypothesis of neutrality when the opposite is the 
case. 
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Another possibility is that certain variables may retain their distributional char- 
acter, but have their values inflated by temporal aggregation. In such situations, 
there would be no reason to reject the neutral model, but inferences about the val- 
ues of parameters would be inaccurate. Even if investigators did not rely upon 
the absolute value of parameters, frequently such inferences (e.g., diversity val- 
ues) are employed as relative comparisons between assemblages. I suspect that 
this has occurred in a number of published studies, but few cultural transmission 
applications include detailed information concerning assemblage duration, so it 
is difficult to redo the researcher's original hypothesis tests with temporal con- 
trols, without going back to original field information or reports. Clearly, both 
possibilities may also occur in some situations. 

As archaeological usage of cultural transmission theory becomes more fre- 
quent and we move from proof-of-concept studies to demanding interpretive ac- 
curacy from our models, methodological research is essential to ensure that our 
applications are empirically and dynamically sufficient. The present study focused 
on a necessary first step in such method development, developing an understand- 
ing of the effect of time averaging in accretional assemblages upon the observable 
variables in neutral cultural transmission models. The results demonstrate that 
frequently employed statistics, such as t e , are highly inaccurate and biased when 
measured in time averaged assemblages, and that neutrality tests are subject to 
enough additional Type I or Type II error that the results can be systematically 
misleading. Clearly, in order to apply cultural transmission models to diachronic 
data derived from time averaged assemblages, we need to develop observational 
tools and methods suited specifically to the archaeological record, instead of sim- 
ply borrowing statistical methods and models from theoretical population biology. 
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Figure 1: Mean value of K„ for time averaged samples, plotted against assemblage duration in 
simulation steps, for each level of 6 in the study. Note that the "onset" of time averaging effects 
(as measured by increased K„), is quite gradual at low 0, while high innovation rates display 
increased richness with very minor amounts of time averaging. 
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Figure 2: Slatkin Exact test failure rate (above the expected 10% given two-tailed test with a = 
0.10, plotted against time averaging duration scaled by mean trait lifetime, for each level of in 
the simulation study. The red vertical line indicates the mean trait lifetime for that 6 value, and 
the shaded region encompasses the standard error of the estimates for mean failure rates at each 
duration. 
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Figure 3: Estimates of mean population innovation rate (E(ff)) from samples (n = 100) taken 
for neutrality tests, using the approximation by Watterson (1975). Plotted against assemblage 
duration, for each level of actual innovation rate used in simulation runs. 
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Figure 4: Estimates of mean population innovation rate (E(0)) from samples (n = 100) taken for 
neutrality tests, using results from Montgomery Slatkin's neutrality test software. Plotted against 
assemblage duration, for each level of actual innovation rate used in simulation runs. 
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Figure 5: IQV diversity index, derived from samples of size 100, plotted against time averaging 
duration scaled by mean trait lifetime, for each level of 6 in the simulation study. The red vertical 
line indicates the mean trait lifetime for that 9 value. 



TA Duration 


Min Sample Size 


Max Sample Size 


1 


130494 


247491 


3 


4497 


43494 


7 


1926 


18639 


15 


897 


8694 


Ji 


A 7< 




62 


216 


2103 


125 


105 


1038 


250 


516 


981 


500 


255 


486 


1000 


114 


228 


2000 


57 


114 


4000 


27 


54 


8000 


12 


16 



Table 1 : Breakdown of sample sizes for analysis of trait richness (K n ), by size of time-averaging 
"window." Some values of 9 required larger numbers of simulation runs to achieve stable result, 
thus the difference between samples sizes at the same TA duration. 
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Theta 


E(K n ) 


Simulated K n 


Sim. Stdev K n 


2 


6.054 


6.511 


1.838 


4 


9.022 


8.991 


2.269 


8 


12.869 


12.616 


2.464 


12 


15.397 


15.306 


2.571 


16 


17.228 


17.187 


2.569 


20 


18.629 


18.737 


2.486 


40 


22.601 


22.693 


2.253 



Table 2: Comparison of expected K n from (8) with simulated values from WF-IA model, for 6 
values from 2 to 40. Total sample size across 6 values is 408,478 samples of size 30. 
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Theta 


Mean Trait Lifetime 


E(ti) 


0.10 


36.54 


36.89 


0.25 


25.61 


24.05 


0.50 


21.10 


19.97 


1.00 


17.31 


17.21 


2.00 


14.57 


15.21 


5.00 


12.43 


13.05 


10.00 


10.83 


11.57 


20.00 


9.50 


10.16 


30.00 


8.68 


9.36 


40.00 


8.12 


8.79 


50.00 


7.72 


8.36 


60.00 


7.36 


8.01 


70.00 


7.08 


7.72 


80.00 


6.83 


7.46 


90.00 


6.60 


7.42 


100.00 


6.40 


7.05 



Table 3: Mean lifetime (in model generations) of traits, by 9, along with analytical approximation 
from Equation 10. 
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#0 


E(0) 


aid) 


0.10 


0.36 


0.21 


0.25 


0.50 


0.26 


0.50 


0.76 


0.33 


1.00 


1.17 


0.42 


2.00 


1.85 


0.51 


5.00 


3.49 


0.67 


10.00 


5.23 


0.87 


zU.UU 




(J.yj 


30.00 


9.70 


0.99 


40.00 


10.99 


0.99 


50.00 


12.19 


1.00 


60.00 


12.94 


1.01 


70.00 


13.76 


0.97 


80.00 


14.32 


0.98 


90.00 


14.85 


0.94 


100.00 


15.27 


0.95 



Table 4: Mean Estimated Theta (E(6)) from Samples (n=100) compared to actual values employed 
in simulation models (6q), without any time-averaging. 
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